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variables. The predictor variables can be defined at any level of a 
hierarchical regression model. The predictor variables are latent but can be 
measured indirectly by using tests or questionnaires. The observed responses 
on these itemized instruments are related to the latent predictors by an IRT 
model.. It is shown that the multilevel model with measurement error in the 
observed predictor variables can be estimated in a Bayesian framework using 
Gibbs sampling. Handling measurement error via the normal ogive model is 
compared with alternative approaches using the classical true score model. An 
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Abstract 



This article focuses on handling measurement error in predictor variables using item 
response theory (IRT). Measurement error is of great importance in assessment of theoretical 
constructs, such as, intelligence or the school climate. Measurement error is modeled by treating 
the predictors as unobserved latent variables and using the normal ogive model to describe 
the relation between the latent variables and their observed indicator variables. The predictor 
variables can be defined at any level of an hierarchical regression model. The predictor variables 
are latent but can be measured indirectly by using tests or questionnaires. The observed 
responses on there itemized instruments are related to the latent predictors by an item response 
theory model. It will be shown that the multilevel model with measurement error in the observed 
predictor variables can be estimated in a Bayesian framework using Gibbs sampling. In this 
article, handling measurement error via the normal ogive model is compared with alternative 
approaches using the classical true score model. Examples using real data are given. 

Key words: classical test theory, Gibbs sampler, item response theory. Hierarchical 
Linear Models (HLM), Markov Chain Monte Carlo, measurement error, multilevel model, two- 
parameter normal ogive model. 
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Introduction 

In much research, and especially in social sciences, measurements are subject 
to measurement error. Examples are educational measurement and attitude measurement. 
Ignoring measurement error often leads to incorrect inferences (see, for example, Cook & 
Campbell, 1979). Most important in assessing measurement error is classifying the type and 
nature of the error and the sources of data which allow modeling of this error. Measurement 
error can be attributed to the method of data collection, to respondent behavior or to properties 
of the instrument. A typical class of errors is the class of systematic errors, or bias. These 
errors, for instance, arise when sampling covers the population of interest unevenly, or when 
treatment and control groups differ prior to treatment in ways that matter for the outcomes 
under study (see, for instance, Rosenbaum, 1995). Another class of errors is the class of non- 
systematic errors. These may, for instance, arise through, errors in coding and classification 
of data. However, measurement errors also include response variation due to the unreliability 
of a measurement instrument. Further, many forms of human response behavior are inherently 
stochastic in nature, and also variation stemming from stochastic response behavior will be 
categorized under the heading measurement error. In this context, Lord and Novick (1968, 
chapter 2) adhere the so-called stochastic subject view in which it is reasonable to assume that 
answers of the subjects depend on small variations in the circumstances of the persons or the test 
taking situation. Accordingly, response variance is the variation in answers to the same question 
when repeatedly administered to the same person. In the present paper, attention is primarily 
focused on non- systematic measurement error, and in the sequel the term measurement error 
will only signify random error. 

There has been a continuing interest in the study of regression models wherein 
the independent variables are measured with error. These models are commonly known as 
measurement error models. The enormous amount of literature on this topic in linear regression 
is summarized by Fuller (1987) and in this framework, measurement error is handled by the 

9 

classical additive measurement error model. An example is the classical test theory model 
used in educational measurement. Goldstein (1995) extended some of the techniques to handle 
measurement errors in the independent variables in linear models to the multilevel model. 

The classical additive measurement error model is based on assumptions that may not 
always be realistic. First, measurement error is supposed to be independent of the predictor 
variables. Further, the assumption of homoscedasticity entails equal variance of measurement 
errors conditional on different values of the dependent variable, say, the score level of the 
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test taker in educational measurement. Another problem is that the reliability of measures 
are not easily assessed. One could take repeated measurements to obtain an estimate of the 
error variance. However, besides the practical difficulties, it is not realistic to assume that the 
repeated measures are independent. Second, a suitable population has to be defined because the 
definition of reliability is population dependent. To overcome these problems it is assumed that 
the variances and covariances of the measurement errors are known, or suitable estimates exists 
(Goldstein, 1995, pp. 142). But the estimates of the measurement error variance are generally 
imprecise. It is, for instance, well known that coefficient Alpha, which is the ratio of the 
variance of the true scores to the variance of the observed scores, underestimates the reliability 
(Lord & Novick, 1968). An estimate of the reliability is always based on the responses to the 
items of a finite sample of persons and therefore also a standard error of the estimate is needed 
(Verhelst, 1998). Further, in case of the usual maximum likelihood approach the ratio of the 
error terms’ variances or alternatively one or both of the variances ought to be known to identify 
the model (Fuller, 1987, pp. 9-11). 

In the present paper, attention is focused on another way of handling response 
variance in the independent variables in a multilevel model. The sources of data to perform a 
measurement error analysis are tests or questionnaires consisting of separate items. The idea is 
to assemble these multiple discrete indicators of predictor variables into an item response (IRT) 
measurement model. In item response theory, measurement error is defined conditionally on 
the value of the latent ability. In IRT, measurement error can be defined locally, for instance, as 
the posterior variance of the ability parameter given a response pattern. This local definition of 
measurement error results in hetroscedasticity: in the Rasch model, for instance, the posterior 
variance of the ability parameter given an extreme score is greater than the posterior variance 
of the ability parameter given an intermediate score (see, for instance, Hoijtink & Boomsma, 
1995, pp. 59, Table 4.1). Besides the fact that reliability can be defined conditionally on 
the value of the latent variable, IRT offers the possibility of separating the influence of item 
difficulty and ability level, which supports the use of incomplete test administration designs, 
optimal test assembly, computer adaptive testing and test equating. 

Besides IRT, another theme of this article wil be Bayesian data analysis. The 
formulation of measurement-error problems in the framework of a Bayesian analysis have 
recently been developed (Carroll et al., 1995; Richardson, 1996). It provides a natural way of 
taking into account of all sources of uncertainty in the estimation of the parameters. Computing 
the posterior distributions involves high-dimensional numerical integration but these can be 
carried out straightforwardly by Gibbs sampling (Gelfand et al., 1990; Gelman et al., 1995). 
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Furthermore, the Bayesian approach of estimating the parameters of an IRT model ensures 
that the model is identified without needing prior knowledge about the variances of the 
measurement errors. It will be shown that the model is identified in a natural way by fixing 
the latent ability scale. 

This article consists of eight sections. After this introduction section, a general 
multilevel model will be presented, where some of the covariates are unobserved. In the 
next section, two measurement error models will be discussed. Then, a Markov Chain Monte 
Carlo (MCMC) estimation procedure will be described for estimating the parameters of a 
multilevel model with measurement error in covariates on both levels. In the following section, 
measurement error in correlated predictors will be discussed. Then, after a small simulation 
study, examples of the procedure will be given. And finally, the last section contains a 
discussion and suggestions for further research. 

The Structural Multilevel Model 

There is a growing interest in the problems associated with describing the relations 
between variables of different aggregation level, for example, in the field of educational and 
social research. In school effectiveness research, interest is focused on the effects of school- 
variables on the educational achievement of the students. To evaluate school effectiveness, 
information is needed on both the level of students and the school-level. The heterogeneity 
in student and school characteristics requires a statistical model that takes the variation and 
relationships at each of the levels into account. Multilevel models support these requirements. 
A number of investigators have examined the issue of multilevel modeling of educational data 
(Bryk & Raudenbush, 1992; De Leeuw & Kreft, 1986; Goldstein, 1995; Raudenbush, 1988, 
Snijders & Bosker, 1999). 

The hierarchical model that is commonly used in analyzing continuous outcomes 
is a two-level formulation in which Level 1 regression parameters are assumed multivariate 
normally distributed across Level 2 units. Suppose that students (Level 1), indexed ij 
(i = 1 = 1, . . . , J) , are nested within schools (Level 2), indexed j (j = 1, . . . , J). 

In its general form, Level 1 of the two level model consists of a regression model, for each 
of J nesting Level 2 groups ( j = 1, . . . , J), in which the observations ( , i = 1, . . . , J) are 
modeled as a function of Q predictor variables . . . , A Qj , that is, 

Vij = Poj + Pljh-lij + ■ • • + Pqjkqij + ■ • ■ + + e V7> W 
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where e, is an (n ; x 1) vector of residuals, that are assumed to be normally distributed with 
mean 0 and variance <7 2 I n;i . The regression parameters are treated as outcomes in a Level 2 
model given by 



Pqj — 7 <70 "b 7 < 7 1 b 1 <77 + • • • + 7 qs^ sqj + • • • + 'YqS^Sqj + U qji f° r 9 0) • • • > Qy ( 2 ) 

where the Level 2 error terms u q j, q = 0, ... ,Q, have a multivariate normal distribution with 
mean zero and covariance matrix T, 7 9S and r sgJ are Level 2 regression coefficients (fixed 
effects) and predictor variables, respectively. Although the coefficients of all the predictors in 
the Level 1 model could be treated as random, it can be desirable to restrain the variation in one 
or more of the regression parameters to zero. This is accomplished by reformulating the model 
as a mixed model (Raudenbush, 1988; Seltzer et al., 1996). This will be further explored below 
in the estimation procedure. 

The explanatory variables at Level 1 comprise information of students’ characteristics, 
such as, for example, gender or age. Level 1 explanatory variables can also be latent 
variables, such as, for example, socio-economic status, intelligence, community loyalty, social 
consciousness, managerial ability or willingness to adopt new practices. Explanatory variables 
as region, school-funding or gender are directly observable, but latent variables are inherently 
measured with error due to response variance. Below, an example will be given of an analysis 
where students’ abilities, regarding mathematics, are predicted by scores, on Level 1, obtained 
using an IQ test and, on Level 2, obtained using an adaptive instruction test taken by teachers. 
Both explanatory variables are measured with an error due to response variance. In predicting 
students’ abilities an increase in precision (i.e. reduction in a 2 ) could be obtained by using 
student pretest scores as a covariate in the Level 1 model but errors in the predictor variables 
cause bias in estimated regression coefficients (Carroll et al., 1995, pp. 22). 

Below, the unobserved Level 1 covariates are defined as 9 whereas the directly 
observed covariates are defined as A. Therefore, Level 1 of the structural model, formula 

(1) , is reformulated as 

Vi] = Poj + PljQlij + • • • + PqjQqij + P(q+l)j^(q+l)ij + • • • + Pqj^Qv + e *j> @) 

where the first 1 , ,q predictors correspond to unobservable variables and the remaining 
< 7 + 1 , . . . , Q predictors correspond to directly observable variables. The Level 2 model, formula 

(2) , containing predictors with measurement error, £, and directly observed covariates, T, is 
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reformulated as 



Pqj = 7 9 0 + 'KqlClqj + • • • + IqsCsqj + 7 9 (s+l)r(s+l)<?j + • • • + 7 qS^Sqj + u qji (4) 

for q = 0, . . . , Q, where the first 1, . . . , s predictors correspond to unobservable variables and 
the remaining s + 1, . . . , S correspond to known fixed constants. The set of variables 0 is never 
observable but supplemented information about 6, denoted as X, is known. In this case, X is 
said to be a surrogate, that is, X has no information about Y other than what is available in 0. 
This is characteristic for nondifferential measurement error (Carroll et al., 1995, pp. 16-17). 
On Level 2, W is defined as a surrogate for £. Nondifferential measurement error is important 
because parameters in models for responses can be estimated given the true covariates even 
when the true covariates (6, C) are not observable, as will be shown below. 

Suppose that on Level 1 and 2, formula (3) and (4), only unobserved predictor 
variables are available, with all regression parameters on Level 1 varying across Level 2 groups. 
Then the relationship between Yy and (X# , W,-) can be expressed as 

E (Yij | Xy , Wj- ) = E [E (Yij | dij , C j , Xy , Wj) | Xy , Wj] 

= E[E(Yij | OijXj) I Xij.wJ 
= E[B ir Ixy .w,-] 

= £[^y |x ij ]-(E[C i |w i ]. 7 ). 

The second equality above is justified by the assumption of nondifferential measurement error. 
The third and fourth equality follow from the substitution of formula (4) in (3) with no 
directly observable variables and determining the conditional expectation of Y {j given (0y, C,) . 
Obviously, unless properly adjustments are made statistical inference can be very misleading 
because of the product of measurement errors. That is, without appropriate methods for 
correcting for the effects of measurement error, the effects can range from biased parameter 
estimates to situations where real effects are hidden and signs of the estimated coefficients are 
reversed relative to the case with no measurement error (Carroll et al., 1995, pp 21-23). 

Measurement Error Models 

It will be shown that all parameters in the model can be estimated on account of 
the assumption of nondifferential measurement error, but first the relationship between the 
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surrogate and the unobserved covariate is discussed. In this section, attention is focused on two 
parametric models for the response: the well-known classical true score model and the normal 
ogive model. 



The Classical True Score Model 

In psychological and educational measurement, the researcher attempts to measure an 
unobservable characteristic with a test. This test is administered to a person repeatedly, where 
the individual is assumed to remain unchanged throughout the process. The individual’s score 
on a particular test form, the observed score, is considered to be a chance variable with some, 
usually unknown, frequency distribution. The mean (expected value) of this distribution, that is, 
the average score that the person would obtain on infinitely many independent repeated trials is 
interpreted as the true score. The error of measurement is the discrepancy between the observed 
scores and the true score. Since, by definition, the expected value of the observed scores is the 
true score, the expectation of the errors of measurement or error scores is zero. It is assumed 
that the corresponding true scores and error scores are uncorrelated and that error scores on 
different measurements are also uncorrelated. Denote Xijk as the measurement associated with 
individual ij, let 0^ be the mean of the response distribution and let the sampling deviation 

for the fc-th response obtained from the k - th individual’s response distribution, that is, 



This is the classical true score model (see, for example, Lord & Novick, 1968). The true score 
Oij of a person indexed ij is defined as the expected value of the observed score where the 
expectation is taken with respect to the response distribution. This response distribution is 
hypothetical because in psychology and other subject areas it is usually not possible to obtain 
more than a few independent observations. This model coincides mathematically with the 
classical additive measurement error model (Fuller, 1987, equation 1.1.2), where a normal 
distribution of the error variable is assumed. 

It is not strictly necessary to assume that the response distribution variances are equal 
for different persons. This means that it is possible to measure some persons’ responses more 



accurately than others. But error variances for individual examinees are usually subject to large 
sampling fluctuations. In the sequel, the group specific error variance, denoted as <p, is used as 
an estimate of the individual error variances, where the group consists of the total number of 
examinees. This group specific error variance is the variance over the examinees of the errors 
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of measurement, which is equal to the specific error variance averaged over the total number 
of examinees (Lord & Novick, 1968, pp. 155). The group specific error variance is used as an 
approximation to the individual error variances of which it is the average. 

The Normal Ogive Model 

Item response models are item-based. In case of dichotomous items, the item response 
function (traceline, item characteristic curve) is the probability of a correct response to an item 
as a function of ability. In this section, the normal ogive model is considered as a measurement 
error model (see Lord, 1980, pp. 27-41 for a complete description of the normal ogive model). 
Accordingly, the probability of a correct response of a person indexed ij on an item indexed k 
(k = 1, . . , , K) , X ijk = 1, is given by 

P {X ijk = 1 | Op ,a k ,b k ) = $ (a k 0ij - b k ) , (6) 

where $ denotes the standard normal cumulative distribution function, and a k and b k are the 
discrimination and difficulty parameter of item k , respectively. Below, the parameters of item k 
will also be denoted by £ k = (a k , b k ) . An IRT model provides the frequency distribution of test 
scores for an examinee indexed ij having a specified level Op of ability or skill. The variance, 
a x \e > °f this conditional distribution of number right-score Xy is 

K 

<rl iAdij = P = 1 I M [! “ p = 1 1 . «*> M] 

k=l 

K 

= ^ $ ( a k eij - b k ) $ (b k - a k 0ij ) . (7) 

fc=i 



Notice that this implies response variance given 0. The posterior distribution of 6ij given Xy , 
p(6ij | Xy) , is proportional to the distribution of Xy given the ability level 0y, p (xy | Op) , 
multiplied by the standard normal distribution. Therefore, the posterior variance of p (Op | Xy ) 
or local reliability, Og ., ,| , is closely related to response variance and it follows that 

this results in the possibility of hetroscedasticity. Furthermore, the measurement scale is 
independent of the items in the test. This in contrast to classical test theory, where the true 
score depends on the items in the test and homoscedasticity is assumed. 
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An MCMC Estimation Procedure for a Multilevel Model with Measurement Error 

Bayesian analysis of parametric models requires the specification of a likelihood and 
prior. Often non-informative priors are used. The posterior distribution, which is derived 
from the joint density of the data and parameters according to Bayes formula, summarizes 
all of the information about the values of the parameters. Interest is focused on the expected 
a posteriori values of the parameters and posterior standard errors. In principle, complex 
models, such as the proposed multilevel model with measurement error in the covariates, 
demand sophisticated numerical analytical methods to obtain estimates of the parameters of 
interest. However, Markov Chain Monte Carlo algorithms (MCMC) have proven great potential 
for estimating complex models and currently the Gibbs sampler (Geman & Geman,1984) is 
receiving much attention in the literature (e.g., see, Bernardo & Smith, 1994; Gelfand & Smith, 
1990; Robert & Casella, 1999). Gibbs sampling succeeds because it reduces the problem of 



dealing simultaneously with missing data and a large number of related unknown parameters 
into a much simpler problem of dealing with one unknown quantity at a time by sampling each 
from its full conditional distribution. This sampling-based method is conceptually simple and 
easily implemented. In a proper setting, the Gibbs sampler generates a Markov chain which 
converges in distribution to the joint posterior distribution of the parameters of interest (Tierney, 
1994). That is, a Markov chain is constructed in such a way that its stationary distribution, also 
denoted limiting distribution, is the joint posterior distribution of the model parameters. The 
chain can be simulated using only the full conditionals of the parameters, that is, these are the 
only densities used for simulation. 

First, the implementation of the Gibbs sampler is considered in case of a multilevel 
model with a normal ogive model as measurement model for the predictor variables. In 
this implementation it is assumed that all predictor variables are uncorrelated. Second, the 
implementation of the Gibbs sampler is described with the classical true score model as 
measurement model. Correlated predictors with measurement error will be discussed in the 
next section. 

Estimation using Gibbs Sampling 

Evaluation of the model for the observed data is complicated by the fact that some 
elements are missing. Here, as is usual in a Bayesian analysis the unobserved 9’s and £’s 
are treated as unobserved random parameters. Let 0^ be the first q explanatory variables on 
Level 1 which are latent, as in formula (3) . The set of explanatory variables on Level 1 for 
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predicting Y# is defined as Sl t] = (0y, Ay) where Ay consists of the remaining q + 1, . . . , Q 
observable covariates on Level 1 without measurement error. Further, let C qj be the first s 
latent explanatory variables in predicting (3 qj on Level 2, as in formula (4) . To complete the 
description of the covariates on Level 2, let Vl/qj = (C qj , r gj ) represent the set of explanatory 
variables for (3 qj , where are the remaining s + 1, . . . , S directly observable variables, also 
according to formula (4) . 

The MCMC algorithm is straightforwardly implemented with the introduction of the 
continuous latent variable that underlies each binary response. This approach follows the 
procedure of Albert (1992), which builds on the Data Augmentation algorithm of Tanner 
and Wong (1987), and has been extensively used in other missing data problems (see, for 
example, Beguin, 2000; Fox & Glas, 2000; Johnson & Albert, 1999, pp. 194-202; Maris, 
1995; Robert & Casella, 1999, pp. 414-438). Assume that the latent variables 0 qi] are related 
to the observed responses, X qi j k , of a person, indexed ij, on an item, indexed k (k = 1, . . . , K) . 
This observation X qi j k can be interpreted as an indicator that a continuous variable with normal 
density is below or above 0. Denote this continuous variable as Z q ^ k , where the superscript x 
denotes the connection with the observed response variable Xqijk- It is assumed that X q i jk = 1 
if Z q *j k > 0 and X qi]k = 0 otherwise. It follows that 

V {Zqijk | @ q ij ) £fcj %qijk) OC f {z q ijk\ & k 9 qij b k , 1) [/ (z q ijk > 0) / (x q ij k 1) 

+ I ( Zqijk < 0 ) I ( Xqijk = 0 )] , 



where /(.; a k Q q ij — b k , 1) stands for the normal density with mean equal to ak9 qi j — b k and 
variance equal to one, and I (.) is an indicator variable taking the value one if its argument 
is true, and taking the value zero otherwise. Further, Oqij and ^ are the person and 
item parameters for person ij and item k, respectively. The matrix serves to simplify 
calculations and the value of does not affect the value of the estimator, that is, is only 
a useful device. Let W 3qj k be a dichotomous response variable of a Level 2 unit, indexed j, 
on an item, indexed k , related to the s ih Level 2 latent variable, £ sqj , for predicting f3 qj . For 
example, C sqj might be the pedagogical climate of school j measured using a questionnaire 
with dichotomously scored questions administered to a teacher or principal of school j. In the 
same way as for Level 1, complete data are formed and the augmented data will be denoted 
With Z iqjk- 

Unlike the fully conditional distributions of the parameters, the full posterior 
distribution has an intractable form and is very difficult to simulate. On the other hand, 
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it will be shown below that the fully conditional distributions of the parameters are each 
tractable and easy to simulate. The Gibbs sampler consists of sampling from one of the 
parameters conditionally on all other parameters in a number of steps. Instead of showing 
all steps in detail, references will be given for those steps which are well-known and don’t 
need any further explaining. The total procedure consists of stepwise drawing from the 
conditional posterior distributions of the components Z^ x \ 6 , 0, a 2 , 7, T, and 

£. The procedure consists of 10 steps: 

(1) Draw Z( x ) conditional on 6, and X. 

(2) Draw conditional on 6 and Z^ x \ 

(3) Draw 0 conditional on Z^ x \^ x \0, a 2 , ft, and Y. 

(4) Draw 0 conditional on fi, ’3', cr 2 , 7, T and y. 

(5) Draw 7 conditional on 0, ’3' and T. 

(6) Draw a 2 conditional on 0, fl and y. 

(7) Draw T conditional on 0, and 7. 

(8) Draw Z b") conditional on and W. 

(9) Draw conditional on C and Z^ . 

(10) Draw £ conditional on Z^ w \^ w \0, ’if' and 7. 

Sampling augmented data, Z (l \ and sampling the item parameters, £ (l) , is described 
by Albert (1992) and Fox and Glas (2000). The third step, sampling 6, deserves a more detailed 
description. 

Step 3 The q latent predictor variables, 0 Uj , . . . ,6 qij , can be sampled individually because it 
is assumed that they are uncorrelated. The ability parameters given augmented data Z^-j and 
parameters , 0 3 and cr 2 are independent and distributed as a mixture of normal distributions 
in relation to the latent variable 0 qij . That is, the augmented data Z^j and the observed data 
Yij are normally distributed with, among others, parameter 6 qij which is a priori normally 
distributed. The two-parameter normal ogive model must be identified by fixing the origin and 
scale of the latent dimension. Therefore, the mean and variance of the ability distribution is 
fixed to zero and one, which avoids over-parametrization. Accordingly to formula (3) , the 
definition of the augmented data and the prior for 6 ^ it follows that 
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where are the set of explanatory variables for a person, indexed ij , on Level 1 without 9 qi] . 
Split the regression coefficients on Level 1, /3 ; , in 0 q] and to distinguish the regression 
coefficient of explanatory variable 9 q ij from the regression coefficients of the other explanatory 
variables respectively. Formula (8) is the product of a normal model for the regression 
of Z q *j k + b k on a k with 9 qij as a regression coefficient, a normal model for the regression of 
Yij - /3 on 0 qj with 9 qij as a regression coefficient and a standard normal prior for 9 qij . 
Due to standard properties of normal distributions (e.g., see, Box & Tiao, 1973; Lindley & 
Smith, 1972) is the fully conditional posterior density of 9 q ij again normally distributed and 
given by 




( 9 QI] /V + dqj-j/4) 1 

yi/v+i/t+i' i/v + i/(p+i 



( 9 ) 



with 9 qij = (E£=i Ofc) 1 Ej£=i M • and 9^ = P qj fa - 0™ '«r.) , the 

variances are v = (E*=i a fc) and 0 = Notice that the posterior expectation, formula 

(9) , is the well-known composite or shrinkage estimator. The estimate of 9 q ij is a combination 
of two estimates, 9 qij and 9 qij , where the amount of weight placed on the estimates depends on 
the corresponding precision of the estimate. Notice that the standard normal prior for 9 q ij adds 
a factor 1 to the reciprocal of the total posterior variance but has no influence on the posterior 
expectation. 

The modification of the multilevel model to handle measurement error in the 
covariates causes minimal change in the complete conditional distributions of the parameters 
of the multilevel model, (/3,7,cr 2 ,T) , computed in steps 4-7. The full conditionals of the 
multilevel model parameters, necessary for the estimation procedure, can be found in Fox and 
Glas (2000) and Seltzer (1993, 1996). 

Measurement error in the predictor variables on Level 2 are treated in the same way 
as on Level 1, with a normal ogive model as measurement model. Therefore, augmented 
data denoted as Z^ w \ in relation to the observed data W, itemparameters and £ have 
to be sampled. An adapted complete conditional of Z ^ given £, £j^can be found in Albert 
(1992) and Fox and Glas (2000). Also an adapted complete conditional distribution of the item 
parameters can be found therein. This comprehends steps 8 and 9. 

Step 10 Split the regression coefficients 7 9 on Level 2 in j qs and 7q*\ relating to the predictor 
( sq] and remaining Level 2 covariates \f r “ , respectively, where is the set of explanatory 
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variables for p qj on Level 2 without ( sqj . Notice that the latent predictor variables ( lqj , ( sqj 
can be sampled individually, because it is assumed that they are independent. Here, the Level 
2 model, formula (4), is reformulated as, 

Pq j ~ = IqsCsqj + U QV ( 10 ) 



where u q j ~ N (0, r qq ) and r 2 q is the q th diagonal element of T. From formula (10) follows 
the least squares estimator ( sqj = 7" 1 (^/3 qj — Tf^&qj) . The parameters ( sqj given augmented 
data and parameters ^ w \P qj , ’® r " and 7 , are independent and distributed as a mixture 
of normal distributions. That is, augmented data, and regression coefficient, 3 q j, are 
normally distributed with, among others, parameter ( sq j which is a priori normally distributed. 
Therefore, it follows that 

p((sqj I Z wj>& W) >Pqj>' if qi>7q) K P ( Z i,j I Cs«» P (Pqj I Csqj> 7,) P(<sqj)- C 1 0 



For identification of the model the prior for ( sqj is the standard normal distribution. Hence, the 
fully conditional posterior density of ( sqi is given by 



r . I z {w) . 3 . \j/~ 'y 

S sqj I ^ sqj J S* ) >^qj > * qj > • < 



N 



K 



1 ?^ + r 1 C. 



sqj 



i + i + 1 

K ip 



i + i + 1 



(12) 



where ( sql is the least squares estimator following from the regression of Z^J k + b' k on a' k and 
k the variance of C, sqi , as in Step 3. The item parameters — (a’ k ,b' k ) are sampled in Step 9. 
Finally, ( sqj is the least squares estimator for ( sqj , formula (10) , with variance ip = 1 / 7 ^,. 



This implementation of the Gibbs sampler is easily changed into an estimation 
procedure for estimating the parameters of the structural (multilevel) model with the classical 
true score model as measurement error model. It is assumed that the variance structure, <p, is 
known and given by formula (5) . This is also necessary for identification of the model. The 
surrogates X and W provide a sum-score or observed score Xy of the examinee indexed ij on 
Level 1 and a sum-score, Wj, observed in school j. Thus, in this case the classical true score 
model, instead of the normal ogive model, is used as measurement error model on Level 1 and 
Level 2. It is easily seen that Step 1, 2, 8 and Step 9 can be left out. Step 3 and Step 10 changes 
into the following two steps. 
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Step 3' Let X qi j denote the observed score of a person, indexed ij, in relation to 9 qiJ , the 
q th latent covariate on Level 1 in predicting Yij. Again, the latent predictors on Level 1 can 
be sampled separately because it is assumed that they are independent. Further, X qi j is a 
random variable taking on values from independent repeated measurements, which is normally 
distributed with mean 9 q ij and variance ip. The complete conditional of 9 qiJ follows from the 
regression of X q ij on 9 qi j and the regression of Y i3 on formula (3) . It follows that 

P (Pqij I ^iji > *Pi Xqij iVij) 0^ P(®}ij I @qij i *P) P {Vij I > ^ij > ^ ) ‘ 

The prior information for 9 q ij is incorporated into the measurement error model, where the 
distribution and variance structure of the true score is determined. It follows that the fully 
conditional posterior density of 9 q ij is given by 



Qqij | to i j,P j ,o 2 ,'P,X qi:j ,Y ij ~ N 



( x qiJ /p + 9 qij /(f) 1 \ 

y l/yj + l /</> ' l/<p+l/<j>) ’ 



(13) 



with 9ij and <t> as in formula (9) . 

The classical true score model can also be used for modeling the measurement error in 
the predictor variables on Level 2. Let Q gq] be the expected value of the observed score, W 3qj , 
where the expectation is taken with respect to the normal distribution, the assumed response 
distribution. Further, define k as the variance, a priori known, over parallel observations of 
W sq j ■ It follows that can be sampled in the same way as in Step 3'. That is, Step 10', draw 

Csqj conditional on W aqj ,K, f3 qj , ^ qj and 7 ,. 



In formula (3) it is assumed that every regression coefficient varies across Level 2 
groups. In certain applications, it can be desirable to constrain the effect of one or more of 
the Level 1 predictors to be identical across Level 2 units. An implementation of the Gibbs 
sampler, where regression coefficients are treated as non-varying across Level 2 groups, needs 
a further division of regression components. This calls for a division in regression coefficients 
related to observed predictors and latent predictors, with a further subdivision of both parts into 
components treated as random and components treated as non-random across Level 2 groups. 
Finally, the complete conditional distribution of each subset, given the other parameters and 
the data, must be specified (see, for example, Seltzer et al., 1996). 
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The presented 1 0 steps define the Gibbs sampler for estimation of the parameters of 
the multilevel model with measurement error in the predictor variables, where the normal ogive 
model or the classical true score model is used as measurement error model. With initial 
values for the parameters, the Gibbs sampler repeatedly samples from the full conditional 
distributions with systematic scan, that is, the sampler updates the components in the natural 
ordering. A different strategy in updating the components can affect the speed of convergence 
(Roberts & Sahu, 1997). The values of the initial parameters are important for the rate of 
convergence. Initial estimates can be obtained by running the MCMC procedure by Albert 
(1992) for estimating the normal ogive model with estimates of the item parameters as starting 
points using Bilog-MG (Zimowski et al., 1996). Means of the sampled values of the parameters 
of the normal ogive model are used to sample the parameters of the multilevel model. After 
convergence, means of the sampled values are used as initial estimates. 

Convergence can be evaluated by comparing the between and within variance of 
generated multiple Markov chains from different starting points (see, for instance, Robert & 
Casella, 1999, pp. 366). Another method is to generate a single Markov chain and to evaluate 
convergence by dividing the chain into subchains and comparing the between- and within-sub- 
chain variance. A single run is less wasteful in the number of iterations needed. Besides, 
a unique chain and a slow rate of convergence is more likely to get closer to the stationary 
distribution than several shorter chains. In the example given below, the full Gibbs sample was 
used in estimating all parameters instead of subsampling from this sample. The latter procedure 
leads to losses in efficiency (MacEachem & Berliner, 1994). Finally, after the Gibbs sampler 
has reached convergence and ‘enough’ samples are drawn, posterior means of all parameters 
of interest are estimated with the mixture estimator to reduce the sampling error attributable to 
the Gibbs sampler (Liu et al., 1994). The posterior standard deviations and credibility intervals 
can be estimated from the sampled values obtained from the Gibbs sampler. 

Measurement Error in Correlated Predictor Variables 

In this section, measurement error in explanatory variables on Level 1 will be modeled 
by an IRT model for the item responses related to these explanatory variables. Because it is 
not realistic to assume that the predictor variables are independent, a multivariate IRT model 
will be used as measurement error model. The same procedure can be applied to measurement 
errors in correlated explanatory variables on Level 2. It is assumed that there exists a surrogate 
for every unobserved predictor variable and every surrogate consists of a set of item responses. 
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Assume that the latent variables O q ij are related to observable variables X gij , 
(g=l,...,Q) via a normal ogive IRT measurement model. In this case X gij = 
(Xgiji , . . . , X gijKq y, with realization (x«yi, . . • ,aV;jc,)‘ , denotes a response vector on a test 
with K q items. Before the actual parameters 0 will be identified, consider a parametrization 
0* . Let 0*j be the vector of latent predictor variables for a person indexed ij, that is, d* 0 has 
elements 0V. Further, suppose that for every predictor a two-parameter compensatory normal 
ogive model holds, that is, P (X qijk = 1 | a* k , b* k ) = $ (a* k 6 qij - b* k ), where a* k and 
b qk are item parameters of an item of predictor q. Because the predictor variables 9 giJ are 
considered dependent, it will be assumed that 0 has a multivariate normal distribution with 
mean zero and covariance matrix £*. However, the parametrization 9* can be transformed 
to a parametrization 6 such that 0 has a multivariate normal distribution with mean zero and 
covariance matrix I, that is, the variables 9 qij become independent. Under this transformation, 
the normal ogive model transforms to 



where a,*; is a vector of discrimination-parameters, say, factor loadings (see, for instance, 
McDonald, 1967, 1982, 1997). Notice that every item response now depends on all latent 
dimensions. This gives rise to the following procedure. 

Analogous with the above procedure, see Step 1 to 3 above, a random vector = 
(Zuji , . . . , Zcmkq)* is introduced, where Z qijk ~ N (a qk 0ij - b qk , l), and it is supposed that 
X qijk = 1 when Z qijk > 0 and X qijk = 0 otherwise. After deriving the fully conditional 
distributions, the Gibbs sampler can again be used to estimate the posterior distributions of all 
parameters. 

Step 1: Sampling Z. Given the parameters 0 'y and £ qk , the variables Zqi jk are independent 
and 



Step 2: Sampling 6 ir Let 0 l3 be the vector with Q predictor variables for a person indexed ij. 
These are the regression coefficients in the normal linear model 



P (X q ij k — 1 | 9ij , Rqk i bq k ) — $ (&q k Qij b qk ) , 




(14) 



Tiij -I- b = A Oij + £ij , 
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where b = (&n, . . . , bi Kl ,b 2 i, b QKQ )\ Oij = (Ouj, O q ^) 1 and A is a K q x Q^j 
matrix with row vectors a gfc , concerning items k = 1 , ,K q and predictors q = 1 ,Q. 
Furthermore, the vector Eij has elements e q ij k , which are independent and standard normally 
distributed. Here, it is assumed that all Level 1 predictors are unobserved and their regression 
coefficients are treated as varying across Level 2 groups. For identification of the model, Oij 
has a multivariate standard normal prior, it follows that 



V ( 0^ | z ij, yij,Z qk , Pj,cr 2 ) oc p (z tj \ Oij,£ qk ) p ( y tj \ Q l]} P v a 2 ) 



f {Oij\ 0, Iq) . 



As in the unidimensional case, described above, the mixture of multivariate normal 
distributions results in a multivariate normal distribution with a shrinkage estimator as 
expectation, 



0^ | Z ij,Yij,£ qk ,(3j,G 2 ~ N 



r- 1 ^ + 

T-' + Q-'+Iq 



, (T _1 + <& - 1 +Iq) 



-1 



(15) 



where % = (A f A) _1 A t (Z 1J + b) and Oij = {P-jP-j) ' 0-j ( Y a ~ @ 0 j ) . with p_, = 
{Plj, • • • , Pq 0 ) . The corresponding variances are T = (A f A) 1 and = a 2 (P t _ ] P_ J ) 

Step 3: Sampling £ qk . Let £ qk = (a, fc> b qk )\ k = 1, . . . , K q and q = 1, . . . , Q, which represent 
the item-parameters of item k of a test relating to predictor q. Further, define 0 = (0\, . . . ,0q) 
with 0 q = (0 9 n , • • • , Oqnjjf • Given 0 , the Z qk = { Z qllk , • • • , satisf y the linear model 

Z qk =[0 -1 ] £ qk + E qk (16) 



where E qk = (e g iik, • • • , £ qnjjk )* are standard normally distributed. Combining the prior for 
V ($qk) = n?=i 1 (agfc > o) with equation (16) gives 

Q 

( I 0, z,„ ~ N (?,*, (H‘H)-’) n / (a,* > 0) , 

* < 7=0 

where H = [ 0 — 1 ] and £ qk is the least squares estimator based on (16). 

Again, this procedure could be extended to handle observed and non-observed 
explanatory variables with regression coefficients altering or fixed across Level 2 units. Notice 
that the steps described in the previous section for sampling the other parameters of the 
structural model remain the same. Modeling measurement error in the correlated predictor 
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variables with the classical true score model demands a lot of prior information. The group 
specific error variance regarding all tests has to be known, that is, the covariance matrix of Q 
explanatory variables of person ij has to be known. The covariance matrix of the correlated 
latent predictor variables also identifies the model, in case of the classical true score model as 
measurement error model. Then, the conditional distribution of Oij becomes 






N 



T _1 Xi,- + Q-'Gu 



Y 1 + $ 






where = (xuj , . . . , xq^) and x q i 0 is the sum-score of person ij on a test related to predictor 
q. Further, Y is the a priori known covariance matrix of the sum-scores of person ij. In most 
cases, the covariance matrix is population dependent and fixed over persons taking the tests to 
get a reliable estimate. 

The location of the unobserved predictors can be fixed by transforming each sample 
during the Gibbs sampling process. Grand mean or group-mean centering of an unobserved 
explanatory variable is obtained by subtracting the grand mean or group-means from each 
sample drawn in each step of the Gibbs sampler. 



A Simulation Study 

In this section, a numerical example is analyzed to illustrate parameter recovery with 
the Gibbs sampler. Data were simulated using a multilevel model with two latent predictors. 
The model is given by, 



Vij 


— Poj "F Plj^lij “1” 


(17) 


Ay 


= Too + ToiCioj + u oj 




Pi j 


= 7io + u iji 





where e tJ ~ N (0, a 2 ) and ~ N (0, T) . Furthermore, it was assumed that the surrogates X 
and W were related to the latent predictors 6 and £ through a normal ogive model. Response 
patterns were generated according to a normal ogive model for tests of 20 items. For the test 
relating to the latent covariate 6 at Level 1, 4, 000 response patterns were generated which were 
divided over J = 200 groups of 20 students each. Accordingly, for the test relating to the latent 
covariate C at Level 2, 200 response patterns were generated. The generating values of the item 
parameters are shown under the label Generated in Table 1 and the true values of the fixed and 
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random effects, 7 , a 1 and T, are shown under the label Generated in Table 2. 

The normal ogive models were estimated with the MCMC procedure of Albert (1992) 
with Bilog-MG estimates as starting values. Next, the parameters of the multilevel model 
are sampled, given the parameters of the normal ogive models. In the simulation study, 500 
iterations were needed to estimate the measurement error models and another 500 iterations 
were needed to compute the parameters of the multilevel model. Subsequently, 20,000 
iterations were made to estimate the parameters of the multilevel model with the normal ogive 
model as measurement error model. The convergence of the Gibbs sampler was checked by 
examining the plots of sampled parameter values. It was concluded that a bum-in period 
of 1, 000 iterations was sufficient. The model was identified by fixing a discrimination and 
difficulty parameter of both tests to the true values to insure that (0, C) were scaled the same 
way as in the data generation phase. 

In Table 1, the estimates of the item parameters issued from the Gibbs sampler, 
associated with the measurement error model for 0, are given under the label Gibbs Samper. 
The reported standard deviations are the posterior standard deviations. Credibility intervals 
are calculated as confidence regions for the parameters and they are given in the column 
labeled CL These credibility intervals are the 95%-equal-tailed-intervals whose endpoints 
are the 2.5 and 97.5 percentiles of the marginal posterior distribution of the parameters. 
The true parameter values are well within the computed credibility intervals, except for the 
discrimination parameter of item 5 and the difficulty parameter of item 14. The estimates of the 
item parameters, from the test relating to £, and the true parameter values are also quite close 
but contain larger standard deviations due to the small number of groups. 

Table 2 presents the results of estimating the parameters of the multilevel model. It 
is remarkable that the estimate of the variance on Level 2, related to the intercept, and of 
the covariance between the Level 2 residuals are too high in the case where the normal ogive 
model is used as measurement error model. This probably arises from an inaccurate estimate 
of £, which may be due to the small number of groups and items in the test. For comparative 
purposes, the unweighted sums of the item responses were rescaled to the same scale as the 
true explanatory variables (0, £) . The estimates of the fixed and random effects using observed 
scores without measurement error are given under the label Classical True Score Model. It 
can be verified that the estimated parameters obtained using the observed scores, instead of 
the normal ogive model, differ more from the true parameter values. Additionally, only the 
credibility intervals of the parameters ( 7 00 , 7 01> ti) contain the true parameter values. 
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An Illustrative Example of Measurement Error in Hierarchical Models 

The model was used in an analysis of a mathematics test, from a large scale study 
in which 3713 pupils of grade 4 were followed in 198 regular primary schools (Bosker et 
al., 1999). Among other things, interest was focused on the relation between achievement in 
mathematics and educational provisions at the school level and adaptive instruction by teachers. 
A test measuring the willingness, knowledge and capability to introduce educational program 
changes was taken by teachers. This test, denoted as AI, consisted of 23 dichotomously scored 
items. 

By posing the following Level 1 model, the nested structure of the data was taken into 
account. For each school j (j = 1, . . . , J ) , 

Vij = 00 j 4" PljlQij "F e iji 0®) 

where is the score of a mathematics test and IQ i3 is an unobserved predictor representing 
the intelligence of a person indexed ij. IQ was measured by an intelligence test of 37 items, 
the response patterns of 3713 pupils were available. The eij are assumed normally distributed 
with mean zero and variance a 2 . 

First, it was assumed that the intercept was group-dependent and varies randomly from 
school to school. Furthermore, the AZ-scores are group level variables that express relevant 
attributes of the schools and are supposed to have an influence in the diversity in mathematics 
scores. Therefore, the variability in (3 0j was modeled as 

P 0 j = 7oo + 7 oi^j + u 0j ( 19 ) 

Pi j = 7io> 

where Uoj were assumed normally distributed with variance Tq. 

The number of iterations was fixed for each analysis. From examining the plots 
of sampled parameter values, it was concluded that a bum-in period of 500 iterations was 
sufficient. Then an additional 20,000 Gibbs cycles, from which parameters of the posterior 
distribution were estimated, were run. 

Table 3 presents the parameter estimates of model 1, formula (18), where a 
measurement error model was applied to the unobserved explanatory variable representing 
the IQ values of the examinees. The estimated group specific error variance, <p, was .39. 
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Notice, this estimate for the group specific error variance was obtained by averaging the 
unbiased estimates for the error variances of individual examinees (Lord & Novick, 1968, 
pp. 155). For the moment, the mean observed score from the AI test was used, neglecting 
its error component. Further, the model was estimated neglecting both error components of 
the predictors, <p = 0. The main result of the analysis is that, conditionally on IQ, adaptive 
instruction for teachers seems to have a small positive effect on mathematics achievements of 
students, but this effect does not differ significantly from zero. Furthermore, individuals with 
high IQ values score high on the mathematics test. The use of multilevel model was justified, 
because a substantial proportion of the variation in the outcome at the student level was between 
schools. This is the variance of the achievements of students in school j controlling for IQ, 
around the grand mean, 7 00 , which does not differ significantly from zero. 

There are some important differences between the parameter estimates from the 
multilevel model with the normal ogive model and the classical true score model, with <p = .39 
and </? = 0 as measurement error model, denoted by M\,M C \ and M c 2 , respectively. The 
parameter estimates in Table 3 are not comparable because the IQ predictors in the various 



models are differently scaled. A better way to compare the models is by looking at the posterior 
predictive data, Y rep , under the different models (Carlin & Louis, 1996; Gelman et al., 1995, 
1996). Let Y rep denote a future observation, independent of Y given the underlying model 
parameters. Define L\j as the distance from Y r - ep to Yj given model M and data (Xj, Wj) , so 



where p (y™ p \ 6 tJ , P 3 ,& 2 ) is the probability of replicating data y™ p given the underlying 
parameters, p(0y,or 2 | Xy, y) and p (C, | w^y,) are the joint posterior density of the 
unobserved explanatory variables 6 and variance cr 2 at Level 1 and the posterior density of 
£ at Level 2 given the observed data, respectively. This statistic summarizes the information 
concerning the predictive data given the observed data. Besides, it is the sum of the variance 
of the replicated data plus the square of the bias of the replicated data with respect to the 




Aggregating over schools results in 



E[L\\M,y] = E [(y - y rep ) 2 | M, y] 

= n / E [ L i> I M > yj] p I C^.y*) p (c j I w*. y i) dPjdCj, 



( 20 ) 
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observed data. Notice that replications of the predictive data are independent of the scale of the 
predictors in the various models. This predictive criterion is based on the quality of prediction 
of a replicate of the observed data. In examining a collection of models, predictive distributions 
will be comparable. Further, it is a natural way to evaluate model performance by comparing 
what it predict with what has been observed (Bernardo & Smith, 1994, pp. 397). If the model 
fits, replicated data under the model should look similar to the observed data, which means that 
E [L\ | M, y] should be small. Large values of this statistic indicate that replicated data under 
the model differ from the observed data, and the model does not fit the data. 

Table 3 presents the E[L\] and corresponding standard deviations for the various 
models. Model Mi, with an IRT measurement error model, performs better than model M c 2 , 
which ignores measurement error in both predictor variables. In fact, model M c 2 is the standard 
hierarchical linear model treating the AI and IQ variables as observed. So, using an IRT 
measurement error model results in a better model fit in terms of minimization of E [Lj]. Model 
M c i, with a classical true score model and prior knowledge <p = 0.39, performs better than 
model M\. That is, the classical true score model increases the variability of the predictors and 
reduces the biases caused by the measurement error in a more effective way than the normal 
ogive model. 

Interesting at this point is to see what happens if a measurement error model is used 
on Level 2. So the response variance of the AI test is modeled using (19). Table 4 presents the 
parameter estimates of the multilevel model with measurement error in the predictor variables 
on Level 1, IQ, and Level 2, AI. The model labeled M 2 , models both unobserved predictors 
with a normal ogive model. Model M& contains the classical true score model as measurement 
error model for both predictors with <pi = .39 and tp 2 = -43 as the estimated response 
variance for the IQ and AI test, respectively. The results from both models show that adaptive 
instruction for teachers still has no significant effect on the mathematics achievements of 
students. Further, students with high IQ scores still perform better than students with lower 
scores. The proportion of variance in mathematics scores accounted for by group-membership, 
controlling for IQ scores, is .291 using model M 2 and .396 using model This indicates a 
substantial difference between both models. 

Model Mc 3 considers response variance in all predictors. This results in better 
replications of the data with respect to the E [L \] . As before, the variability in the predictors 
induces larger variances of the parameter estimates and decreases the distance between the 
replicated data and the observed data. It can be seen that correcting for bias results in more 
variable estimates but also in a better prediction of the data. Model M 2 has no benefit from the 
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normal ogive model as measurement error model on Level 2, the E [L\\ stabilizes with respect 
to model Mi. The small number of responses, 20 items with 198 respondents, may highly affect 
this result. More respondents taking the AI test, may lead to a better result with respect to using 
the normal ogive model as measurement error model on Level 2. Here it can be concluded that 
correcting for measurement error with the classical true score model on both levels resulted 
in more variance of the parameter estimates but less bias and this is beneficial in terms of the 
predictive criterium given by formula (20) . In general, the use of a measurement error model 
led to a reduction in bias and variance of the replicated data in relation to the observed data in 
all cases. 

It seems that varying the measurement error, ip, leads to the conclusion that more 
variance results in better predictions with respect to the observed data. However, there is a 
turning point where additional prior variance, p, leads to a higher value of E [Lf] . Figure 1 
displays the E [L\] for various values of the error variance in the predictor variables on Level 
1 and Level 2. It can be seen that the value of E [L%\ is above 1.5 when the variance in the 
predictor variable, IQ, on Level 1 is low. For various values of error variance in AI this 
statistic decreases to .4 when the error variance in IQ is between .1 and .4 and it goes up to 2. 
when the variance in IQ rises to 1. For some error variances in the Level 2 predictor the E [L\] 
stays below .5 for all error variances below 1. in the Level 1 covariate. Generally, the prior 
information about the group specific error variance highly influences the results. 

Discussion 

In this article, a normal ogive model is imposed on the unobserved explanatory 
variables in a multilevel model. In the social sciences, it is rarely possible to measure all 
relevant covariates directly and accurately. Correcting for measurement error is dependent on 
knowledge of the measurement error process. Here, the normal ogive model describes the link 
between the observed data and the unobserved variables. This is compared with the classical 
true score model as measurement error model. To strengthen the relevancy of the chosen 
measurement error model the effects of measurement error are determined by the measurement 
error distribution. Appropriate methods for correcting for the effects of measurement error 
depend on the measurement error distribution (Carroll et al., 1995). It is shown that both 
measurement error models reduce the bias in the estimates with an increase of the variance. 
This bias versus variance trade-off works well in both cases. Better results are obtained with 
the more flexible classical true score model in terms of the expected square distance between the 
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observed and predicted data. But for a realistic way of modeling it requires information about 
the group specific error variance. The classical true score model depends highly on this prior 
information. This leads to a certain degree of arbitrariness. Moreover, the variance structure 
of the errors in the predictor variables is difficult to estimate. Therefore, it can be said that the 
alternative, the normal ogive model, is more conservative in terms of the used statistic, but it 
encompasses a more realistic way of modeling measurement error in the predictor variables, 
because it does not depend on any arbitrary assumption on the error variance structure. 

An important point js the flexibility of the proposed estimation procedure. This 
enables modeling of complicated measurement error models without artificial simplifying 
assumptions. Prior knowledge is easily incorporated, which insures a more realistic way 
of modeling measurement error. Further, it is possible to model unobserved compositional 
variables at Level 2, that is, a measurement aggregated over the characteristics of the Level 1 
units within Level 2 units. An example is the mean intake achievement of all the pupils in a 
school. 

It is possible to use other IRT models as a measurement error model. For example, 
the three-parameter item response model and polytomously scored items can be estimated 
within the Bayesian framework using the Gibbs sampler (Beguin, 2000; Johnson & Albert, 
1999). If the conditional distribution of some parameters is difficult to sample from, then 
a Metropolis-Hastings step within Gibbs sampler can be used to obtain samples from the 
posterior distribution of the specific parameters (Chib & Greenberg, 1995). The test statistic 
discussed above only focuses on the extent to which the observed data are reproduced by the 
model. Other posterior predictive checks can be developed to judge the fit and assumptions 
of the model with measurement error in the covariates, such as local independence and 
homoscedasticity, but this is beyond the scope of the present article. 

In the present article, the response variable, Y, is treated as observed without 
measurement error. It is possible to extend the procedure and to model this variable also with an 
IRT model. This more complex problem, where both the response and some of the predictors 
are measured with error, deserves further research. The basic structure of this more complex 
model is related to the multilevel IRT model (Fox & Glas, 2000) or the generic hierarchical 
IRT model (Patz & Junker, 1999) with background variables measured with an error. This 
whole framework is also strongly related to the framework of structural equation modeling, 
where there is a measurement part and a structural part. The measurement part of the model 
consists of the response variable and observed predictor surrogates and latent variables, and 
the structural part is defined in terms of the latent variables regressed on each other and some 
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observed background variables. In MIMIC modeling (see, for example, Bollen, 1989; Muthen, 
1989), one or more latent variables intervene between the observed background variables 
predicting a set of observed response variables and surrogates. The main difference between 
these approaches and the one presented here is the use of an IRT model as a measurement error 
model, and integration of these various approaches remains a point of further study. 
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Table 1. Item parameter estimates of the normal ogive IRT model in measuring 9 . 
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Table 2. Parameter estimates of the multilevel model with measurement error in the 



covariates. 
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Table 3. Parameter estimates of the multilevel model with the normal ogive and the classical 
true score model as measurement error models. 
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Table 4. Parameter estimates of the multilevel model with the normal ogive and the classical 
true score model as measurement error models on Level 1 and Level 
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